Feature Selection-Ranking Methods in a Very Large Electric Database
نویسندگان
چکیده
Feature selection is a crucial activity when knowledge discovery is applied to very large databases, as it reduces dimensionality and therefore the complexity of the problem. Its main objective is to eliminate attributes to obtain a computationally tractable problem, without affecting the quality of the solution. To perform feature selection, several methods have been proposed, some of them tested over small academic datasets. In this paper we evaluate different feature selection-ranking methods over a very large real world database related with a Mexican electric energy client-invoice system. Most of the research on feature selection methods only evaluates accuracy and processing time; here we also report on the amount of discovered knowledge and stress the issue around the boundary that separates relevant and irrelevant features. The evaluation was done using Elvira and Weka tools, which integrate and implement state of the art data mining algorithms. Finally, we propose a promising feature selection heuristic based on the experiments performed.
منابع مشابه
Diagnosis of Heart Disease Based on Meta Heuristic Algorithms and Clustering Methods
Data analysis in cardiovascular diseases is difficult due to large massive of information. All of features are not impressive in the final results. So it is very important to identify more effective features. In this study, the method of feature selection with binary cuckoo optimization algorithm is implemented to reduce property. According to the results, the most appropriate classification fo...
متن کاملIFSB-ReliefF: A New Instance and Feature Selection Algorithm Based on ReliefF
Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomp...
متن کاملBest Subset Feature Selection for Massive Mixed-Type Problems
We address the problem of identifying a non-redundant subset of important variables. All modern feature selection approaches including filters, wrappers, and embedded methods experience problems in very general settings with massive mixed-type data, and with complex relationships between the inputs and the target. We propose an efficient ensemble-based approach measuring statistical independenc...
متن کاملAn Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملA Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004